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Abstract 
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messages  on  a  square  mesh  with  N  processors.  A  novel  sorting  algorithm  for 
unidirectional  rings  achieves  the  first  lover  bound. 
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1.  Introduction 

To  cooperate  to  solve  a  problem,  the  processors  in  a  distributed  comput¬ 
ing  system  must  communicate  among  themselves.  For  both  large  computer  net¬ 
works  and  VLSI  architectures,  however,  the  inclusion  of  a  shared  memory  to 
facilitate  interprocessor  communication  is  usually  infeasible.  The  processors 
in  these  distributed  systems  can  communicate  only  by  sending  messages  via  a 
network.  Thus  to  exploit  fully  the  potential  efficiency  of  a  distributed  sys¬ 
tem,  an  efficient  algorithm  should  minimize  the  message  traffic  in  order  to 
minimize  the  computation  time. 

The  problem  of  finding  an  extremum  —  also  called  electing  a  leader  —  in 
a  distributed  system  is  well  solved  (Dolev  et  al . .  1982;  Matsushita,  1983; 
Peterson,  1982).  Efficient  distributed  algorithms  have  also  been  proposed  for 
determining  medians  (Frederickson,  1983;  Matsushita,  1983;  Kodeh,  1982;  San¬ 
toro  and  Sidney,  1982),  minimum  spanning  trees  (Gallager  et  al . .  1983),  shor¬ 
test  paths  (Chandy  and  Misra,  1982),  and  maximum  flows  (Segall,  1982). 

It  is  natural  to  ask  whether  these  algorithms  achieve  the  smallest  possi¬ 
ble  message  traffic  for  each  problem.  Let  lg  denote  the  logarithm  taken  to 
base  2.  For  the  extreme-finding  problem  Burns  (1980)  established  a  lower 
bound  of  0.25  N  lg  N  messages  in  the  worst  case  on  a  bidirectional  ring. 

Pachl  et  al.  (1982)  proved  that  0.693  N  lg  N  messages  are  necessary  on  the 
average  on  a  unidirectional  ring.  Apparently,  lower  bounds  for  no  other  prob¬ 
lems  have  been  discovered. 

In  this  paper  I  derive  lower  bounds  on  the  number  of  messages  required  to 
arrange  N  values  into  sorted  order.  Let  the  values  be  in  {0,  ....  L}.  Every 
sorting  algorithm  requires  Q(N^  messages  on  a  bidirectional  ring 

with  N  processors.  Every  sorting  algorithm  requires  Q(N^^  )  messages 
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I 
I 

on  a  sqnare  mesh  with  N  processors.  Evidently  fewer  messages  are  necessary  if 
L  is  small  than  if  L  is  large.  | 

Furthermore.  I  present  a  simple  sorting  algorithm  on  rings  that  achieves  | 

the  Q(N2  lower  bound.  Within  a  constant  multiplicative  factor,  this 

lg  N 

algorithm  is  optimal.  The  algorithm  of  Korach  et  al .  (1982)  uses  0(N2)  mes¬ 

sages  to  rank  the  values  in  a  network,  but  does  not  rearrange  the  values. 

Section  2  defines  the  computational  model  and  the  sorting  problem.  Sec¬ 
tion  3  establishes  the  lower  bounds  on  message  complexity.  Section  4 
describes  the  optimal  sorting  algorithm  for  rings,  and  Section  5  presents 
other  sorting  algorithms. 


This  paper  adopts  the  model  of  distributed  computation  developed  by  San¬ 


toro  (1981).  The  model  is  asynchronous,  requires  decentralized  control, 
admits  no  shared  memory,  and  permits  data  transfers  only  on  a  communication 
ne twork. 

The  distributed  computing  system  comprises  N  identical  processors  con¬ 
nected  via  a  communication  network.  A  link  is  an  ordered  pair  of  processors, 
and  a  network  is  a  set  of  links.  Processor  z  can  send  a  message  directly  to 
processor  y  if  and  only  if  link  (z,  y)  is  in  the  network. 

Every  processor  runs  the  same  program.  Initially,  each  processor  knows 
only  the  links  that  involve  it  and  the  overall  topology  of  the  network  —  for 
ezample,  whether  the  network  is  a  ring  or  a  mesh. 

Each  of  the  processors  has  a  distinct  number  representable  with  0(lg  N) 
bits  called  its  initial  value.  The  processors  ezchange  messages  to  compute  a 
function  of  these  values.  At  the  end  of  the  computation,  every  processor  has 
*  final  value. 

The  transmission  of  a  message  incurs  an  unpredictable  but  finite  delay, 
and  the  state  of  a  processor  changes  whenever  it  receives  a  message.  At  pro¬ 
cessor  y  every  message  is  placed  on  a  queue  when  it  arrives.  Messages  that 
arrive  simultaneously  are  queued  arbitrarily.  Messages  sent  on  the  same  link 
(z,  y)  arrive  at  y  in  the  same  order  as  they  were  sent. 

To  each  processor  assign  an  integer  p,  0  £  p  <  N.  For  simplicity,  to 
obviate  the  phrase  mod  N.  also  assign  the  integers  p  +  N,  p  +  2N,  ...  to  the 


4 


same  processor  p.  The  assignment  of  integers  to  processors  is  nsed  only  for 
clarity  of  exposition;  since  the  processors  are  identical,  processor  p  does 
not  actually  have  immediate  access  to  the  nnmber  p.  If  integer  p  is  assigned 
to  processor  x  and  q  is  assigned  to  processor  y,  then  the  link  (x,  y)  will  be 
written  (p,  q) .  The  phrase  "processor  p"  also  denotes  processor  x. 

I  consider  several  topologies  for  the  comoranication  network.  In  a 
bidirectional  ring,  processor  p  can  send  messages  only  to  processors  p  -  1  and 
P  +  1.  Formally,  the  bidirectional  ring  has  links  (p,  p  -  1)  and  (p,  p  +  1) 
for  every  p.  In  a  nnidirectional  ring,  processor  p  can  send  messages  only  to 
processor  p  +  1. 

The  discrete  torus  is  a  square  mesh  with  wrap-around  connections.  Let  N 
=  For  each  processor  p,  0  <.  p  <  N,  write  p  *  i  +  jM  such  that  0  i  <  M 

and  0  i  j  <  N.  This  equation  defines  a  bijection  between  {0,  . ...  N  -  1)  and 
pairs  <i,  j>  in  {0.  ....  M  -  1}^.  Processor  p  can  also  be  called  processor 
<i.  j>.  In  the  discrete  torus,  for  every  i  and  j  there  are  links 
( <i.  j).  <i  +  1,  j>).  ( <i.  j>.  <i  -  1.  j>),  ( < i ,  j>,  <i,  j  +  1>). 

(<i,  j>.  <i.  j  -  1>),  where  i  +  1  and  j  +  1  are  taken  modulo  M.  For  example, 
ILLIAC  IV  had  the  topology  of  a  discrete  torus  with  N  =  64. 

A  fully  interconnected  network  has  the  link  (x,  y)  for  every  pair  of  pro¬ 
cessors  x  and  y. 

Each  processor  has  0(lg  N)  bits  of  storage.  This  limit  precludes  trivial 
algorithms.  For  instance,  on  a  fully  interconnected  network,  if  processor  p 
had  unbounded  storage,  then  the  other  processors  could  ship  their  initial 
values  to  processor  p,  which  could  compute  all  the  final  values. 
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The  limitation  on  storage  implies  that  every  message  has  0(lg  N)  bits. 
There  are  no  other  constraints  on  the  form  of  messages;  in  particular,  a  mes¬ 
sage  need  not  be  one  of  the  initial  values.  The  limit  on  message  length 
prohibits  arbitrarily  long  messages.  If  messages  of  unbounded  length  were 
permitted,  then  for  every  solvable  problem  there  would  be  an  algorithm  that 
used  0(N)  messages  after  it  elects  a  leader.  For  example,  on  a  unidirectional 
ring  N  long  messages  would  suffice  to  send  all  the  initial  values  to  the 
leader,  which  would  perform  the  computation,  and  N  more  long  messages  would 
suffice  to  distribute  the  final  values  from  the  leader  to  the  other  proces¬ 
sors. 

To  evaluate  the  performance  of  a  distributed  algorithm,  I  assume  that  the 
processing  time  within  a  processor  is  negligible.  Indeed,  because  computation 
within  a  processor  generally  proceeds  much  faster  than  transmission  of  mes¬ 
sages,  communication  steps  often  dominate  the  running  time  of  an  algorithm 
(Lint  and  Agerwala,  1981).  The  two  performance  criteria  used  in  this  paper 
are  expressed  as  functions  of  N.  The  message  complexity  of  an  algorithm  is 
the  maximum,  over  all  problem  instances,  of  the  total  number  of  messages 
passed  among  all  the  processors  on  that  problem  instance.  This  complexity 
measure  provides  a  worst-case  estimate  of  the  communication  time.  Abelson 
(1980)  and  Papadimitiou  and  Sipser  (1982)  studied  a  similar  measure  for  the 
number  of  transmitted  bits.  The  ideal  execution  time  is  the  maximum,  over  all 
problem  instances,  of  the  amount  of  time  the  computation  would  take  on  that 
problem  instance  if  the  processors  were  synchronised  and  if  every  message 
arrived  one  time  unit  after  it  was  sent.  This  measure  provides  a  lower  bound 
on  the  communication  time.  In  the  terminology  of  Nassimi  and  Sahni  (1980), 
the  ideal  execution  time  is  the  number  of  unit-routes. 
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2.2.  The  Sorting  Problem 

Initially,  for  every  p,  processor  p  has  a  distinct  initial  value  IV(p). 

A  sorting  algorithm  rearranges  these  values  so  that  at  the  end  of  the  computa¬ 
tion,  processor  p  has  a  final  value  FV(p)  such  that  for  some  b, 

FV(b  +  i)  <  FV  (b  +  i  +  1)  for  all  0  <  i  <  N-2. 

Call  processor  b  the  base .  The  base  processor  has  the  smallest  final  value. 

For  a  sorting  algorithm  A  and  a  distribution  of  initial  values,  the  des¬ 
tination  of  a  value  v  is  the  processor  p  such  that  at  the  end  of  the  computa¬ 
tion  of  A,  the  final  value  at  processor  p  is  FV(p)  *  v.  The  destination  of  a 
value  depends  on  which  processor  becomes  the  base. 
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3.  Lower  Bounds 


3.1.  Preliminaries 

Let  SBS(Q)  denote  the  set  of  finite  seqnences  of  binary  strings 
(0}.  ....  pj.)  in  which  every  component  P^  is  &  string  of  at  most  lg  Q  bits. 

Lemma  1.  There  are  fewer  than  2(2Q)k  sequences  in  SBS(Q)  that  have  at 
most  k  components. 

Proof.  Since  each  component  in  a  sequence  in  SBS(Q)  has  at  most  lg  Q 
bits,  the  number  of  possible  components  is 

2  +  ...  +  21*  Q  "  1  +  2*8  Q  <  2Q. 

It  follows  that  the  number  of  sequences  in  SBS(Q)  with  at  most  k  components  is 
smaller  than 


1  +  2Q  +  (2Q)2  +  ...  +  <2Q)k  <  2(2Q)k  .  □ 

Lemma  2.  Let  S  be  a  set  of  a  different  sequences  in  SBS(Q) .  When  each 
occurrence  of  a  string  is  counted,  the  total  number  of  strings  among  the 
sequences  in  S  is  at  least 

4 1  n  if&F  • 

Proof.  Set 

1  - 1  ♦  LifiWJ  ■ 

By  definition, 

lg  (o/lO)  2  (k  -  1)  lg  (2Q) , 
o/5  2  2(2Q)k"1. 


Lemma  1  implies  that  at  least  4/5  of  the  sequences  in  S  have  at  least  k  com¬ 
ponents  each.  The  total  number  of  strings  among  these  sequences  is  at  least 


In  a  distributed  computing  system  call  the  function  p  »->  IV(p)  a  distri¬ 
bution.  If  P  is  a  set  of  processors  in  the  system,  then  the  restriction  of  a 
distribution  d  to  P  is  the  distribution  for  P  induced  by  d.  Distributions  d^ 
am*  d 2  agree  on  P  if  their  values  on  P  are  the  same;  equivalently,  the  r- 
triction  of  dj  to  P  is  identical  to  the  restriction  of  d2  to  P. 

Consider  a  partition  of  the  processors  in  a  system  into  two  sets  P^  a 
?2‘  The  cut  C  induced  by  this  partition  is  the  set  of  all  links  (x,  y)  for 
which  either  x  e  Px  and  y  e  P2  or  x  e  P2  and  y  «  Pr 

Let  A  be  a  distributed  algorithm  that  uses  messages  of  at  most  c  Ig  N 
bits  each.  Let  C  be  a  cut.  During  the  computation  by  A  for  a  distribution  d 
consider  the  sequence  of  messages  transmitted  on  links  in  C  in  the  order  in 
which  they  were  sent.  To  each  message  m  of  this  sequence  append  a  string  of 
2  lg  N  bits  that  identifies  on  which  of  the  at  most  links  in  C  message  m 
was  sent.  Call  the  resulting  sequence  of  binary  strings  the  s ignature  of  A 
for  d  on  C.  The  signature  is  in  SBS(NC+^). 

Lemma  3.  Let  C  be  a  cut  induced  by  a  partition  of  the  processors  into 

Pj  and  P2.  Let  D  be  a  collection  of  distributions  that  agree  on  all  pro¬ 
cessors  in  P^.  If  algorithm  A  has  fewer  than  I D I  different  signatures  on  C 
for  the  distributions  in  D,  then  for  two  different  distributions  in  D,  algo¬ 
rithm  A  produces  the  same  set  of  final  values  in  P£. 

Proof .  By  hypothesis,  there  are  different  distributions  d^  and  d2  in  D 
for  which  A  has  the  same  signature  on  C.  For  both  d^^  and  d2  the  computation 
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by  A  sends  the  same  messages  on  the  same  links  in  C  in  the  same  order.  From 
the  viewpoint  of  the  processors  in  the  computation  by  A  for  d^  is  the  same 
as  its  computation  for  d2>  Consequently,  at  the  end  of  both  computations  the 
final  values  in  P2  are  the  same.  □ 

3.2.  Rinas 

This  section  establishes  a  lower  bound  on  the  message  complexity  of  sort¬ 
ing  in  a  bidirectional  ring.  The  lower  bound  applies  a  fortiori  to  unidirec¬ 
tional  rings  too. 

Theorem  1.  On  a  bidirectional  ring  of  N  processors  with  initial  values 
in  [0,  L] ,  every  sorting  algorithm  has  message  complexity  Q(N^  *  ft  ^  g~^N^ )  • 

In  particular,  if  L  =  2N,  then  Q(N^/lg  N)  messages  are  necessary.  If  L  = 

N  lg  N,  then  Q(N^  lg  lg  N/lg  N)  messages  are  necessary.  If  L  =  Ne  for  a  con- 

2 

stant  e  >  1,  then  Q(N  )  messages  are  necessary. 

Proof .  Consider  an  algorithm  A  that  arranges  values  into  sorted  order 
using  messages  of  length  at  most  c  lg  N  bits  each.  The  main  idea  is  the  fol¬ 
lowing:  for  some  distribution  of  initial  values,  no  matter  which  processor 

becomes  the  base,  approximately  N/4  initial  values  must  migrate  at  least  dis¬ 
tance  N/16  to  their  destinations.  But  the  destination  of  a  value  depends  the 
processor  that  becomes  the  base,  which  in  turn  depends  on  the  initial  values. 
The  bulk  of  this  proof  overcomes  this  circularity. 

Define  R  =  L/N.  Without  loss  of  generality,  assume  that  R  is  an  integer 
and  that  N  -  1  is  divisible  by  16.  Define  a  collection  of  R^  distributions  of 
initial  values  as  follows.  For  p  =  0,  ...»  N  -  1  the  initial  value  at  proces¬ 


sor  p  satisfies 
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(1) 


C(p/2)  R  IV(p)  <  (p/2  +  1)  R  if  p  is  even 

(((N  +  p)/2)  R  <  IV(p)  <  ((N  +  p)/2  +  1)  R  if  p  is  odd 


Example.  (N 

=  17, 

R  =  3) 

IV(0)  = 

0 

IV(5)  =  33 

IV(9) 

=  39 

IV(1)  = 

27 

rv<6)  =  9 

rv(io) 

=  15 

IV(2)  = 

3 

TV(7)  =  36 

IV(ll) 

=  42 

IV ( 3  >  = 

30 

TV(8)  =  12 

rv(i2) 

=  18 

rv(4)  = 

6 

□ 

IV(13 )  =  45 
IV(14)  =  21 
rv(15)  =  48 
IV(16)  =  24 


Since  the  ring  has  N  processors,  there  are  only  N  possible  bases.  There¬ 
fore  there  is  a  base  b  snch  that  for  at  least  R^/N  of  the  distributions 
defined  by  (1).  processor  b  becomes  the  base  daring  the  computation  by  algo¬ 
rithm  A.  Let  D  be  this  collection  of  R^/N  distributions. 


Put 


(2) 


1 


q  =  6  (N  -  1)/16  +  1  +  2b 
r  =  10  (N  -  1)/16  +  2b 
s  =  11  (N  -  1)/16  +  1  +  2b 
t  =  5  (N  -  1)/16  +  2b 


Let  be  the  set  of  (N  -  l)/4  processors  q,  q  +  1,  ...,  r.  Let  P2  be  the  set 
of  5(N  -  l)/8  processors  s,  s  +  1,  ....  t.  See  Figure  1. 


For  the  distributions  in  D  processsor  b  becomes  the  base.  Definition  (1) 
implies  that  for  every  p  the  destination  of  IV(p)  is 
processor  b  +  p/2  if  p  is  even, 

processor  b  +  (N  +  p)/2  if  p  is  odd. 

It  follows  that  for  p  *  q,  q  +  1,  ....  r,  the  destination  of  IV(p)  is  among 
processors 

b  +  (N  +  q) / 2  ■  s,  s  +  1,  ....  b  +  r/2  *  t. 
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Thus  each  of  the  initial  values  in  P^  must  travel  at  least  distance 

1  +  (N  -  1)/16  to  its  destination  in  P^. 

Let  P'j  be  the  set  of  (3N  +  3)/4  processors  that  are  not  in  Pj.  There 
are  r(3N+3)/4  distributions  for  P'j  consistent  with  (1).  Consequently  there 
is  a  distribution  d^  for  P'j  such  that  dp  is  induced  by  at  least 

»W/N  Eiitilil 

R(3N+3)/4  N 

of  the  distributions  in  D.  Let  D'  be  this  subset  of  at  least  g(N“l)/4/N  dis¬ 
tributions.  The  distributions  in  D*  agree  on  P'  Let  a  = 

Consider  the  following  1  +  (N  -  1)/16  pairwise  disjoint  cuts,  which 
separate  Px  from  P2: 

{(q.  q-1),  { q-1 ,  q) ,  (r,  r+1),  (r+1,  r)}, 

(3)  £ ( q— 1 •  q— 2) ,  (q-2.  q-1).  (r+1,  r+2) ,  (r+2,  r+1)},  ... 

Uq  -  Ufgli.  t),  (t.  q  - 

U  *  U£A1.  .).  (..  r  ♦ 

For  each  of  the  1  +  (N  -  1)/16  cuts  C  in  (3)  the  number  of  different  signa¬ 
tures  of  A  on  C  must  be  at  least  |D' I  2  °  because  otherwise,  by  Lemma  3,  there 
would  be  two  different  distributions  in  D'  that  would  yield  the  same  set  of 
final  values  in  Pj,  Let  Q  =  Nc+^>  Let  N(C,  d)  be  the  number  of  messages  used 
by  A  on  links  in  C  for  the  initial  distribution  d.  By  Lemma  2,  for  each  of 
the  1  +  (N  -  1)/1 6  cuts  C  in  (3), 


^.xc.a)  2  4  ?  If 


hence 


cJ«3> 

Therefore  there  exists  a  dj  fn  d*  such  that 


0) 
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}  M(C,di )  >  (  5  5  M(C,d) )/o 

C  in  (3)  1  del)'  C  in  (3) 

■  (E-foii,  (H  - 1)  if. » -(iy «  ao  M 

2  _ mL1&_R _ (N  +  15)  lg  (10  N) 

80  lg  (2NC+2)  20  lg  (2NC+2) 

=  q(M? — La_E) 
lg  N  ’’ 

Ergo,  the  message  complexity  of  A  is  Q(N2  lg  R/lg  N) .  □ 


This  proof  resembles  the  proofs  of  Thompson  (1979),  who  established 
time-space  tradeoffs  in  VLSI.  As  Lipton  and  Sedgewick  (1981)  have  observed, 
Thompson's  technique  is  analogous  to  a  crossing  seqnence  argument  for  Turing 
machine  complexity  (Hopcroft  and  Ullman,  1979). 


3.3.  The  Discrete  Torus 

A  modification  of  the  proof  of  Section  3.2  yields  an  JKN2^2  -^A— ) 
lower  bound  on  the  message  complexity  of  sorting  on  the  discrete  torus. 

Consider  a  discrete  torus  with  N  processors.  Let  M  =  N1^2.  Suppose  the 
initial  value  at  every  processor  p  satisfies  (1)  in  Section  3.2.  For  the  q, 
r,  s,  and  t  defined  by  (2),  the  values  among  processors  q,  q  +  1,  ...,  r  must 
migrate  to  their  destinations  at  processors  s,  s  +  1,  ....  t.  Let  be  the 
set  of  processors  q,  q  +  1,  ....  r.  Let  P^  be  the  set  of  processors  s,  s  +  1, 
....  t.  It  is  easy  to  find  a  set  of  f(N  -  1 ) / ( 1 6  U)  ]  pairwise  disjoint  cuts 
that  separate  P^  and  p^.  As  in  Section  3.2,  for  every  sorting  algorithm, 
there  is  some  distribution  for  which  the  algorithm  uses  at  least 
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4  la  (g/10) 

5  lg  (2NC+2) 


(N- 1)  (N  -  1)  lg  R  ~  4  lg  (10  N) 
16  M  5  lg  (2NC+2) 


2  iN2— -2N)  lg  R  _  (N  -  1)  la  (10  N) 
80  M  lg  (2NC+2)  20  M  lg  (2NC+2) 

«  a(M3-..2,-iA.J) 

MV  lg  N 


messages. 


Theorem  2.  On  a  discrete  tons  of  N  processors  with  initial  values  in 

{0,  ....  L) ,  every  sorting  algorithm  has  message  complexity  Q(N2^2  ) 

lg  N 
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4.  Optimal  Sorting 

Section  4  and  Section  5  present  algorithms  for  the  sorting  problem.  All 
algorithms  first  employ  an  extrema-f inding  algorithm  to  elect  a  leader, 
namely,  the  processor  whose  initial  value  is  smallest.  Message  complexities 
and  ideal  execution  times  for  the  extrema-f inding  problem  are  as  follows: 

_ message  complexity _ ideal  execution  time 

unidirectional  ring  1.44  N  lg  N  +  0(N)  2N  -  1 

(Peterson,  1982) 

discrete  torus  0.72  N  lg  N  +  0(N)  -  3 

(Matsushita,  1983) 

fully  interconnected  network  4.4  N  2.88  lg  N  +  0(1) 

(Matsushita,  1983) 

4.1.  Representing  a  Sorted  Subset 

Let  S  ■  (Sj,  ....  ak)  be  a  nonempty  subset  of  {0,  . ...  L).  Index  the 
elements  of  S  so  that  Sj  <  *2  <  . . .  <  ak.  Let  aQ  *  0.  The  set  S  can  be 
represented  by  the  sequence  (Sj  -  «0,  ....  ak  -  ak_j) .  Encode  this 

sequence  as  follows.  Write  each  a^  -  a^.j  in  binary;  then  replace  simultane¬ 
ously 

0  by  00  1  by  01 

,  by  10  (  and  )  by  11. 

Call  this  encoded  result  E(S).  The  length  of  E(S)  is 

^(2  lg  (Sj  -  aj_j)  +  0(1))  bits. 

By  Jensen's  inequality, 

i  X  lf  (,j  -  \j-i)  1  ‘  •j-i,)* 

Thus  the  length  of  E(S)  is  at  most 


IS 


*  ^  lg  (a.  -  aj^)  +  0(k)  <  2k  lg  (£  \  («j  -  a*.*))  +  0(k) 
j=l  J  J  j=l  J  J 

=  2k  lg  (ak/k)  +  o(k)  1  2k  lg  (L/k)  +  0(k). 

If  every  a^  were  written  out  in  binary,  then  S  would  be  encoded  with 
k  lg  L  +  O(k)  bits.  When  k  is  large,  E(S)  has  fewer  bits. 

This  encoding  permits  efficient  insertion  of  a  new  value  into  S  and  effi¬ 
cient  deletion  of  the  smallest  value  from  S.  To  insert  a  value  b  such  that 
»£  <  b  <  «j+i»  replace  the  encoding  of  aj+^  -  a^  by  the  encoding  of  the  subse¬ 
quence  b  -  a^  ,  *i+i  _  b.  To  delete  the  smallest  value  aj,  replace  the  encod¬ 
ing  of  the  snbsequence  a^  ,  *2  -  *1  #t  the  beginning  of  E(S)  by  the  encoding 


4.2.  A  Sorting  Algorithm  on  Unidirectional  Rinas 

Consider  a  unidirectional  ring  of  N  processors  with  initial  values  is  in 
(0.  ...,  L}.  This  section  presents  an  algorithm  that  sorts  these  values  by 
successive  insertions  with  0(N^  lg  (L/N)/lg  N)  messages.  By  Theorem  1,  this 
algorithm  is  optimal. 

The  algorithm  employs  the  encoding  E  defined  in  Section  4.1.  Let  S 
{0,  ...,  L}  have  k  values.  The  encoding  E(S)  is  transmitted  as  a  sequence  of 
messages,  each  of  length  c  lg  N,  where  c  is  a  constant.  Thus  the  number  of 
messages  used  to  transmit  E(S)  is 

r2k  u  o(k>i  ±  w  ig  91 N) 


since  k  i  N  1  L 
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The  algorithm  comprises  three  phases. 

Daring  the  first  phase,  the  processors  elect  a  leader.  Vithost  loss  of 
generality,  assume  that  processor  0  is  the  leader. 

To  initiate  the  second  phase,  the  leader  sends  E{IV(0)}  to  processor  1. 
For  p  =  0,  ....  N  -  1,  define  S(p)  =  tIV(0),  ....  IV(p)}.  In  general,  daring 
the  second  phase,  processor  p  receives  E(S(p  -  1))  from  processor  p  -  1  and 
sends  E(S(p))  to  processor  p  +  1.  Since  E(S(p  -1))  is  an  encoding  of  a 
sorted  set,  processor  p  need  not  store  all  of  E(S(p  -  1)).  Rather,  processor 
p  inserts  IV(p)  into  S(p  -  1)  at  the  appropriate  point,  as  described  at  the 
end  of  Section  4.1.  At  the  end  of  the  second  phase  the  leader  receives 
E(S(N  -  1)). 

Daring  the  third  phase,  the  processors  successively  remove  the  smallest 
value  from  S(N  -  1).  For  p  *  0,  ....  N-2,  processor  p  receives  an  encoding 
E(S)  from  processor  p  -  1.  It  defines  FV(p)  to  be  the  smallest  value  in  S  and 
sends  E(S  -  {FV(p)J)  to  processor  p  -  1.  Section  4.1  shows  that  the  encoding 
E  supports  efficient  deletion  of  the  smallest  value  in  the  set.  Processor  N  - 
1  receives  the  largest  value. 

Theorem  3 .  On  a  unidirectional  ring  of  N  processors  with  initial  values 
in  (0,  ...,  L} ,  suppose  the  election  problem  can  be  solved  with  p(N)  messages 
in  ideal  execution  time  t(N).  Then  the  sorting  problem  can  be  solved  with 
0(N^  lg  (L/N)/lg  N)  +  p(N)  messages  and  ideal  execution  time  2N  +  x(N)  -  1. 

Proof.  Every  processor  transmits  an  encoding  of  a  set  during  the  second 
phase  and  another  encoding  during  the  third  phase.  Therefore  the  algorithm 


uses  at  most 
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messages  after  it  elects  a  leader. 


0(N2  lg  (L/N) / lg  N) 


Daring  both  the  second  and  third  phases,  every  processor  p  can  transmit  a 
message  to  processor  p  +  1  as  soon  as  it  receives  a  message  from  processor 
p  ~  1.  The  second  phase  rnns  in  ideal  time  N.  The  third  phase  runs  in  ideal 
time  N  -  1.  Consequently  the  ideal  execution  time  is  2N  -  1  after  the  leader 
has  been  elected.  □ 


I 

1 

I 
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5.  Other  Sorting  Alynr* *->»■« 


5.1.  The  Bidirectional  Ring 

Although  the  algorithm  of  Section  4.2  is  optimal  for  a  wide  range  of  ini¬ 
tial  values,  the  odd-even  transposition  sort  (Knuth,  1973)  can  be  implemented 
easily  on  a  bidirectional  ring  of  N  processors  with  message  complexity  0(N  ) . 
The  implementation  has  three  phases. 

In  the  first  phase  the  processors  elect  a  leader.  Without  loss  of  gen¬ 
erality,  assume  that  processor  0  is  the  leader  and  that  N  is  even.  In  the 
second  phase  the  leader  initiates  a  message  around  the  ring  to  deliver  the 
number  N  and  to  inform  each  processor  whether  its  position  p  is  odd  or  even. 

In  the  third  phase  each  processor  executes  the  following  program  frag¬ 
ment.  At  processor  p.  the  initial  value  is  IV,  the  final  value  FV.  The  pro¬ 
cedure  SEND  (+;  J.  V)  sends  the  two-part  message  (J,  V)  to  processor  p  +  1, 
and  SEND  (-;  J.  V)  sends  the  message  (J,  V)  to  processor  p  -  1.  Procedure 
RECEIVE  (J;  V)  waits  until  a  message  whose  first  part  is  J  has  entered  the 
message  queue;  the  second  part  of  this  message  is  assigned  to  the  variable  V. 

FV  :«  IV; 

if  p  is  odd  then 

for  J  :■  1  to  N/2  do 

begin  SEND  (+;  2J  -  1,  FV);  RECEIVE  (2J  -  1,  V); 
if  V  <  FV  then  FV  :*  V; 

SEND  (-;  2J,  FV) ;  RECEIVE  (2J,  V); 
if  V  >  FV  then  FV  :»  V 
end 

else  if  p  is  even  and  p  M  then 
for  J  1  to  N/2  do 

begin  SEND  (-;  2J  -  1,  FV);  RECEIVE  (2J  -  1,  V); 
if  V  >  FV  then  FV  :«  V; 

SEND  (+;  2J,  FV) ;  RECEIVE  (2J,  V); 
if  V  <  FV  then  FV  V 

end 

else  (*  Program  fragment  for  the  leader  •) 
for  J  1  to  N/2  do 


begin  SEND  <-;  2J  -  1,  ») ;  RECEIVE  <2J  -  1,  V); 

SEND  (+;  2J ,  -«) ;  RECEIVE  (2J.  V) 

end 

Sending  2J  -  1  and  2J  keeps  the  messages  properly  ordered.  For  example, 
processor  3  may  send  a  message  with  2J  -  10  to  processor  2  before  processor  2 
receives  a  message  with  2J  -  1  ■  9  from  processor  1. 

Theorem  4.  On  a  bidirectional  ring  with  N  processors  suppose  the  elec¬ 
tion  problem  can  be  solved  with  p(N)  messages  in  ideal  execution  time  r(N). 
Then  the  sorting  problem  can  be  solved  with  N(N  +  1)  +  p(N)  -  1  messages  and 
ideal  execution  time  2N  +  t(N)  -  1. 

Proof.  The  second  phase  uses  N  -  1  messages.  In  the  third  phase  every 
processor  sends  N  messages,  hence  this  phase  uses  messages.  Thus  the  algo- 
rithm  uses  N  +  N  -  1  messages  after  electing  the  leader. 

The  second  phase  runs  in  ideal  time  N  -  1,  and  the  third  phase  runs  in 
ideal  time  N.  Therefore  the  ideal  execution  time  of  the  algorithm  is 
2N  +  t(N)  -  1.  0 

3,2.  The  Fully  Interconnected  Network 

The  well  known  merge-sort  enjoys  a  straightforward  implementation  on  a 
fully  interconnected  network  with  N  processors.  For  convenience  assume  that  N 
is  a  power  of  2. 

First,  the  processors  elect  a  leader.  Without  loss  of  generality,  assume 
that  processor  0  is  the  leader.  Let  PQ  be  the  set  of  processors  0,  .... 

N/2  “  ^  be  the  set  of  processors  N/2,  ....  N  -  1.  Using  one  mes¬ 

sage,  the  leader  designates  processor  N/2  the  temporary  leader  of  P^.  Proces¬ 
sor  N/2  initiates  the  merge-sort  recursively  to  sort  the  initial  values  in  P^; 
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simultaneously  processor  0  initiates  the  serge-sort  recursively  tc  sort  the 
initial  values  in  Pq.  When  each  recursive  invocation  of  this  algorithm  is 
completed,  the  final  values  in  Pq  and  Pj  are  in  ascending  order.  Processor 
N/2  sends  the  leader  a  message  when  P^  is  sorted. 


Next,  the  leader  controls  the  merging  of  the  values  in  Pq  and  P^.  Each 
processor  k  has  a  temporary  value  TV(k)  that  will  become  its  new  final  value. 
The  leader  executes  the  following  algorithm,  which  successively  compares  the 
final  values  now  at  processors  i  and  j  and  sends  the  smaller  to  processor  k. 


i  :=  0;  j  :■  N/2; 

DONEO  :=  false;  DONE1  :»  false; 

Obtain  FV(N/2)  from  processor  N/2; 
for  k  :=  0  to  N-l  do 

if  FV( i)  <  FV(j)  or  D0NE1  then 

begin  Send  FV(i)  to  processor  k,  which  sets  TV(k)  := 

i  :*  i  +  1; 

if  i  <  N/2  then  Obtain  FV(i)  from  processor  i 
else  DONEO  :=  true 


end 

else  if  FV( i)  >  FV(j)  or  DONEO  then 

begin  Send  FV(j)  to  processor  k,  which  sets  TV(k)  := 

j  J  ♦  i; 

if  j  <  N  then  Obtain  FV(j)  from  processor  j 
else  D0NE1  :*  true 


end 


FV( i) ; 


FV( j ) ; 


Finally,  the  leader  sends  a  message  to  every  processor  p  to  set  FV(p)  := 


TV(p) . 


Observe  that  the  leader  needs  to  store  only  one  value  from  P^  and  only 
one  value  from  PQ  other  than  its  own.  Thus  the  number  of  bits  of  storage 
required  by  the  leader  is  0(lg  N) .  Indeed,  every  processor  needs  only  0(lg  N) 
bits  of  storage. 

Let  M(N)  be  the  number  of  messages  used  by  this  algorithm  on  a  fully 
interconnected  network  with  N  processors,  after  the  leader  has  been  elected. 


Then 
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M(N)  =  2  M(N/2)  for  recarsive  invocations  of  the  algorithm 
+2  to  begin  and  end  the  recursive  invocations 

+  (N  -  1)  messages  from  the  leader  to  each  processor 
p  M  to  obtain  FV(p) 

+  (N  -  1)  messages  sending  FV(p)  to  the  leader 
+  (N  -  1)  messages  from  the  leader  to  each  processor 
1  M  to  set  TV(k) 

+  (N  -  1)  messages  from  the  leader  to  each  processor 
p  i1  0  to  set  FV(p)  :  =  TV(p) 

Thus 

M(l)  *  0, 

M(N)  -  2  M(N/2)  +  4N  -  1; 


hence 

M(N)  i  4  N  lg  N. 


Let  T(N)  be  the  ideal 
T(N)  =  T(N/2) 

+  2 

+  (N  -  1) 


(N 

(N 


+  1 


1) 

1) 


execution  time  of  the  algorithm.  Then 
for  recursive  invocations  of  the  algorithm 
to  begin  and  end  the  recursive  invocations 
for  messages  from  the  leader  to  each  processor 
p  M  to  obtain  FV(p) 
for  messages  sending  FV(p)  to  the  leader 
for  messages  from  the  leader  to  each  processor 
k  *  0  to  set  TV(k) 

for  messages  from  the  leader  to  each  processor 
p  M  to  set  FV(p)  :=  TV(p) 


Thus 


T(l)  -  0. 

T(N)  -  T(N/2)  +  3N; 


hence 


T(N)  <  6  N. 


Theorem  5.  On  a  fully  interconnected  network  with  N  processors  suppose 
the  election  problem  can  be  solved  with  p(N)  messages  in  ideal  execution  time 
t(N).  Then  the  sorting  problem  can  be  solved  with  at  most  4  N  lg  N  +  p(N) 
messages  and  ideal  execution  time  less  than  6 N  +  r(N). 


The  algorithm  uses  many  messages  to  initiate  and  end  recursive  invoca¬ 
tions.  These  messages  would  be  unnecessary  if  the  system  were  synchronous. 

The  odd-even  transposition  sorting  algorithm  of  Section  S.l  also  runs  on 
a  fully  interconnected  network.  It  has  s  smaller  ideal  execution  time,  but 
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uses  more  messages. 

5.3.  The  Discrete  Torus 

The  algorithm  of  Nassimi  and  Sahni  (1979),  when  implemented  asynchro¬ 
nously,  uses  0(N^2)  messages  because  each  of  the  N  processors  sends  at  most 
messages.  Therefore,  when  the  initial  values  are  in  {0,  ...,  Ne}  for 
some  constant  e  >  1,  this  algorithm  is  optimal  within  a  constant  multiplica¬ 
tive  factor. 
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